The Difference Between Predictive Modeling and Regression
نویسنده
چکیده
Predictive modeling includes regression, both logistic and linear, depending upon the type of outcome variable. However, as the datasets are generally too large for a p-value to have meaning, predictive modeling uses other measures of model fit. Generally, too, there are enough observations so that the data can be partitioned into two or more datasets. The first subset is used to define (or train) the model. The second subset can be used in an iterative process to improve the model. The third subset is used to test the model for accuracy. The definition of “best” model needs to be considered as well. In a regression model, the “best” model is one that satisfies the criteria of uniform minimum variance unbiased estimator. In other words, it is only “best” in the class of unbiased estimators. As soon as the class of estimators is expanded, “best” no longer exists, and we must define the criteria that we will use to determine a “best” fit. There are several criteria to consider. For a binary outcome variable, we can use the misclassification rate. However, especially in medicine, misclassification can have different costs. A false positive error is not as costly as a false negative error if the outcome involves the diagnosis of a terminal disease. We will discuss the similarities and differences between the types of modeling. INTRODUCTION Regression has been the standard approach to modeling the relationship between one outcome variable and several input variables. Generally, the p-value is used as a measure of the adequacy of the model. There are other statistics, such as the r and the c-statistic (for logistic regression) that are presented, but are not usually considered as important. However, regression has limitations with large samples; all p-values are statistically significant with an effect size of virtually zero. For this reason, we need to be careful when interpreting the model. Instead, we can take a different approach. Because there are so many data values available, we can divide them and create holdout samples. Then, when using predictive modeling, we can use many different models simultaneously, and compare them to find the one that is the best. We can use the traditional regression, but also decision trees and neural network analysis. We can also combine different models. We can focus on accuracy of prediction rather than just identifying risk factors. There is still limited use of predictive modeling in medical research, with the exception of regression models. Most of the use of predictive modeling is fairly recent.(Sylvia et al., 2006) While most predictive models are used for examining costs (Powers, Meyer, Roebuck, & Vaziri, 2005), they can be invaluable in improving the quality of care.(Hodgman, 2008; Tewari et al., 2001; Weber & Neeser, 2006; Whitlock & Johnston, 2006) In this way, predictive modeling can be used to target the patients at highest risk for more intensive case management.(Weber & Neeser, 2006) It has also been used to examine workflow in the healthcare environment.(Tropsha & Golbraikh, 2007) Some studies focus on particular types of models such as neural networks.(Gamito & Crawford, 2004) In many cases, administrative (billing) data are used to identify patients who can benefit from interventions, and to identify patients who can benefit the most. Most of the use of predictive modeling is fairly recent. In particular, we will discuss some of the issues that are involved when using both linear and logistic regression. Regression requires an assumption of normality. The definition of confidence intervals, too, requires normality. However, most healthcare data are exponential or gamma. According to the Central Limit Theorem, the sample mean can be assumed normal if the sample is sufficiently large. However, if the distribution is exponential, just how large is large enough? If we use nonparametric models, we do not have to be as concerned with the actual population distribution. Also, we want to examine patient-level data rather than group-level data. That will mean that we will want to include information about patient condition in any regression model. Additional assumptions for regression are that the mean of the error term is equal to zero, and that the error term has equal variance for different levels of the input or independent variables. While the assumption of zero mean is almost always satisfied, the assumption of equal variance is not. Often, as the independent variables increase in value, the variance often increases as well. Therefore, modifications are needed to the variables, usually in the form of transformations, substituting the log of an independent variable for the variable itself. Transformations require considerable experience to use properly. In addition, the independent
منابع مشابه
Comparing Different Modeling Techniques for Predicting Presence-absence of Some Dominant Plant Species in Mountain Rangelands, Mazandaran Province
In applied studies, the investigation of the relationship between a plant species and environmental variables is essential to manage ecological problems and rangeland ecosystems. This research was conducted in summer 2016. The aim of this study was to compare the predictive power of a number of Species Distribution Models (SDMs) and to evaluate the importance of a range of environmental variabl...
متن کاملPredictive modeling of biomass production by Chlorella vulgaris in a draft-tube airlift photobioreactor
The objective of this study was to investigate the growth rate of Chlorella vulgaris for CO2 biofixation and biomass production. Six mathematical growth models (Logistic, Gompertz, modified Gompertz, Baranyi, Morgan and Richards) were used to evaluate the biomass productivity in continuous processes and to predict the following parameters of cell growth: lag phase duration (λ), maximum specific...
متن کاملComparison of Ordinal Response Modeling Methods like Decision Trees, Ordinal Forest and L1 Penalized Continuation Ratio Regression in High Dimensional Data
Background: Response variables in most medical and health-related research have an ordinal nature. Conventional modeling methods assume predictor variables to be independent, and consider a large number of samples (n) compared to the number of covariates (p). Therefore, it is not possible to use conventional models for high dimensional genetic data in which p > n. The present study compared th...
متن کاملبرازش توابع انتقالی خاک با استفاده از رگرسیون فازی
Pedotransfer functions are the predictive models of a certain soil property from other easily, routinely, or cheaply measured properties. The common approach for fitting the pedotransfer functions is the use of the conventional statistical regression method. Such an approach is heavily based on the crisp obervations and also the crisp relations among variables. In the modeling natural systems, ...
متن کاملبرازش توابع انتقالی خاک با استفاده از رگرسیون فازی
Pedotransfer functions are the predictive models of a certain soil property from other easily, routinely, or cheaply measured properties. The common approach for fitting the pedotransfer functions is the use of the conventional statistical regression method. Such an approach is heavily based on the crisp obervations and also the crisp relations among variables. In the modeling natural systems, ...
متن کاملPredictive Modeling of Phenylpiperazine Derivatives for Renin Inhibition.
The renin–angiotensin–aldosterone system is the well established endocrine system having significant role in preserving hemodynamic stability. Renin is secreted from the juxtaglomerular cells of the kidney. Phenylpiperazine derivatives have been reported as human renin inhibitor. To perform predictive QSAR modeling for 27 phenylpiperazine derivatives as renin enzyme inhibitors. The IC50 values ...
متن کامل